{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Processing Dirty Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Background \n",
    "\n",
    "This is fake data generated to demonstrate the capabilities of `pyjanitor`.  It contains a bunch of common problems that we regularly encounter when working with data.  Let's go fix it!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load Packages\n",
    "\n",
    "Importing `pyjanitor` is all that's needed to give Pandas Dataframes extra methods to work with your data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>First Name</th>\n",
       "      <th>Last Name</th>\n",
       "      <th>Employee Status</th>\n",
       "      <th>Subject</th>\n",
       "      <th>Hire Date</th>\n",
       "      <th>% Allocated</th>\n",
       "      <th>Full time?</th>\n",
       "      <th>do not edit! ---&gt;</th>\n",
       "      <th>Certification</th>\n",
       "      <th>Certification.1</th>\n",
       "      <th>Certification.2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Jason</td>\n",
       "      <td>Bourne</td>\n",
       "      <td>Teacher</td>\n",
       "      <td>PE</td>\n",
       "      <td>39690.0</td>\n",
       "      <td>0.75</td>\n",
       "      <td>Yes</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Physical ed</td>\n",
       "      <td>Theater</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Jason</td>\n",
       "      <td>Bourne</td>\n",
       "      <td>Teacher</td>\n",
       "      <td>Drafting</td>\n",
       "      <td>39690.0</td>\n",
       "      <td>0.25</td>\n",
       "      <td>Yes</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Physical ed</td>\n",
       "      <td>Theater</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Alicia</td>\n",
       "      <td>Keys</td>\n",
       "      <td>Teacher</td>\n",
       "      <td>Music</td>\n",
       "      <td>37118.0</td>\n",
       "      <td>1.00</td>\n",
       "      <td>Yes</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Instr. music</td>\n",
       "      <td>Vocal music</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Ada</td>\n",
       "      <td>Lovelace</td>\n",
       "      <td>Teacher</td>\n",
       "      <td>NaN</td>\n",
       "      <td>27515.0</td>\n",
       "      <td>1.00</td>\n",
       "      <td>Yes</td>\n",
       "      <td>NaN</td>\n",
       "      <td>PENDING</td>\n",
       "      <td>Computers</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Desus</td>\n",
       "      <td>Nice</td>\n",
       "      <td>Administration</td>\n",
       "      <td>Dean</td>\n",
       "      <td>41431.0</td>\n",
       "      <td>1.00</td>\n",
       "      <td>Yes</td>\n",
       "      <td>NaN</td>\n",
       "      <td>PENDING</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Chien-Shiung</td>\n",
       "      <td>Wu</td>\n",
       "      <td>Teacher</td>\n",
       "      <td>Physics</td>\n",
       "      <td>11037.0</td>\n",
       "      <td>0.50</td>\n",
       "      <td>Yes</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Science 6-12</td>\n",
       "      <td>Physics</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Chien-Shiung</td>\n",
       "      <td>Wu</td>\n",
       "      <td>Teacher</td>\n",
       "      <td>Chemistry</td>\n",
       "      <td>11037.0</td>\n",
       "      <td>0.50</td>\n",
       "      <td>Yes</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Science 6-12</td>\n",
       "      <td>Physics</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>James</td>\n",
       "      <td>Joyce</td>\n",
       "      <td>Teacher</td>\n",
       "      <td>English</td>\n",
       "      <td>32994.0</td>\n",
       "      <td>0.50</td>\n",
       "      <td>No</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>English 6-12</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Hedy</td>\n",
       "      <td>Lamarr</td>\n",
       "      <td>Teacher</td>\n",
       "      <td>Science</td>\n",
       "      <td>27919.0</td>\n",
       "      <td>0.50</td>\n",
       "      <td>No</td>\n",
       "      <td>NaN</td>\n",
       "      <td>PENDING</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>Carlos</td>\n",
       "      <td>Boozer</td>\n",
       "      <td>Coach</td>\n",
       "      <td>Basketball</td>\n",
       "      <td>42221.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>No</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Physical ed</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>Young</td>\n",
       "      <td>Boozer</td>\n",
       "      <td>Coach</td>\n",
       "      <td>NaN</td>\n",
       "      <td>34700.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>No</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Political sci.</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>Micheal</td>\n",
       "      <td>Larsen</td>\n",
       "      <td>Teacher</td>\n",
       "      <td>English</td>\n",
       "      <td>40071.0</td>\n",
       "      <td>0.80</td>\n",
       "      <td>No</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Vocal music</td>\n",
       "      <td>English</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      First Name Last Name Employee Status     Subject  Hire Date  \\\n",
       "0          Jason    Bourne         Teacher          PE    39690.0   \n",
       "1          Jason    Bourne         Teacher    Drafting    39690.0   \n",
       "2         Alicia      Keys         Teacher       Music    37118.0   \n",
       "3            Ada  Lovelace         Teacher         NaN    27515.0   \n",
       "4          Desus      Nice  Administration        Dean    41431.0   \n",
       "5   Chien-Shiung        Wu         Teacher     Physics    11037.0   \n",
       "6   Chien-Shiung        Wu         Teacher   Chemistry    11037.0   \n",
       "7            NaN       NaN             NaN         NaN        NaN   \n",
       "8          James     Joyce         Teacher     English    32994.0   \n",
       "9           Hedy    Lamarr         Teacher     Science    27919.0   \n",
       "10        Carlos    Boozer           Coach  Basketball    42221.0   \n",
       "11         Young    Boozer           Coach         NaN    34700.0   \n",
       "12       Micheal    Larsen         Teacher     English    40071.0   \n",
       "\n",
       "    % Allocated Full time?  do not edit! ---> Certification Certification.1  \\\n",
       "0          0.75        Yes                NaN   Physical ed         Theater   \n",
       "1          0.25        Yes                NaN   Physical ed         Theater   \n",
       "2          1.00        Yes                NaN  Instr. music     Vocal music   \n",
       "3          1.00        Yes                NaN       PENDING       Computers   \n",
       "4          1.00        Yes                NaN       PENDING             NaN   \n",
       "5          0.50        Yes                NaN  Science 6-12         Physics   \n",
       "6          0.50        Yes                NaN  Science 6-12         Physics   \n",
       "7           NaN        NaN                NaN           NaN             NaN   \n",
       "8          0.50         No                NaN           NaN    English 6-12   \n",
       "9          0.50         No                NaN       PENDING             NaN   \n",
       "10          NaN         No                NaN   Physical ed             NaN   \n",
       "11          NaN         No                NaN           NaN  Political sci.   \n",
       "12         0.80         No                NaN   Vocal music         English   \n",
       "\n",
       "    Certification.2  \n",
       "0               NaN  \n",
       "1               NaN  \n",
       "2               NaN  \n",
       "3               NaN  \n",
       "4               NaN  \n",
       "5               NaN  \n",
       "6               NaN  \n",
       "7               NaN  \n",
       "8               NaN  \n",
       "9               NaN  \n",
       "10              NaN  \n",
       "11              NaN  \n",
       "12              NaN  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_excel(\"dirty_data.xlsx\", engine=\"openpyxl\")\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Cleaning Column Names\n",
    "\n",
    "There are a bunch of problems with this data. Firstly, the column names are not lowercase, and they have spaces. This will make it cumbersome to use in a programmatic function. To solve this, we can use the `clean_names()` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>first_name</th>\n",
       "      <th>last_name</th>\n",
       "      <th>employee_status</th>\n",
       "      <th>subject</th>\n",
       "      <th>hire_date</th>\n",
       "      <th>%_allocated</th>\n",
       "      <th>full_time_</th>\n",
       "      <th>do_not_edit!_&gt;</th>\n",
       "      <th>certification</th>\n",
       "      <th>certification_1</th>\n",
       "      <th>certification_2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Jason</td>\n",
       "      <td>Bourne</td>\n",
       "      <td>Teacher</td>\n",
       "      <td>PE</td>\n",
       "      <td>39690.0</td>\n",
       "      <td>0.75</td>\n",
       "      <td>Yes</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Physical ed</td>\n",
       "      <td>Theater</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Jason</td>\n",
       "      <td>Bourne</td>\n",
       "      <td>Teacher</td>\n",
       "      <td>Drafting</td>\n",
       "      <td>39690.0</td>\n",
       "      <td>0.25</td>\n",
       "      <td>Yes</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Physical ed</td>\n",
       "      <td>Theater</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  first_name last_name employee_status   subject  hire_date  %_allocated  \\\n",
       "0      Jason    Bourne         Teacher        PE    39690.0         0.75   \n",
       "1      Jason    Bourne         Teacher  Drafting    39690.0         0.25   \n",
       "\n",
       "  full_time_  do_not_edit!_> certification certification_1  certification_2  \n",
       "0        Yes             NaN   Physical ed         Theater              NaN  \n",
       "1        Yes             NaN   Physical ed         Theater              NaN  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_clean = df.clean_names()\n",
    "df_clean.head(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice now how the column names have been made better."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you squint at the unclean dataset, you'll notice one row and one column of data that are missing. We can also fix this! Building on top of the code block from above, let's now remove those empty columns using the `remove_empty()` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>first_name</th>\n",
       "      <th>last_name</th>\n",
       "      <th>employee_status</th>\n",
       "      <th>subject</th>\n",
       "      <th>hire_date</th>\n",
       "      <th>%_allocated</th>\n",
       "      <th>full_time_</th>\n",
       "      <th>certification</th>\n",
       "      <th>certification_1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Chien-Shiung</td>\n",
       "      <td>Wu</td>\n",
       "      <td>Teacher</td>\n",
       "      <td>Physics</td>\n",
       "      <td>11037.0</td>\n",
       "      <td>0.5</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Science 6-12</td>\n",
       "      <td>Physics</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Chien-Shiung</td>\n",
       "      <td>Wu</td>\n",
       "      <td>Teacher</td>\n",
       "      <td>Chemistry</td>\n",
       "      <td>11037.0</td>\n",
       "      <td>0.5</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Science 6-12</td>\n",
       "      <td>Physics</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>James</td>\n",
       "      <td>Joyce</td>\n",
       "      <td>Teacher</td>\n",
       "      <td>English</td>\n",
       "      <td>32994.0</td>\n",
       "      <td>0.5</td>\n",
       "      <td>No</td>\n",
       "      <td>NaN</td>\n",
       "      <td>English 6-12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Hedy</td>\n",
       "      <td>Lamarr</td>\n",
       "      <td>Teacher</td>\n",
       "      <td>Science</td>\n",
       "      <td>27919.0</td>\n",
       "      <td>0.5</td>\n",
       "      <td>No</td>\n",
       "      <td>PENDING</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     first_name last_name employee_status    subject  hire_date  %_allocated  \\\n",
       "5  Chien-Shiung        Wu         Teacher    Physics    11037.0          0.5   \n",
       "6  Chien-Shiung        Wu         Teacher  Chemistry    11037.0          0.5   \n",
       "7         James     Joyce         Teacher    English    32994.0          0.5   \n",
       "8          Hedy    Lamarr         Teacher    Science    27919.0          0.5   \n",
       "\n",
       "  full_time_ certification certification_1  \n",
       "5        Yes  Science 6-12         Physics  \n",
       "6        Yes  Science 6-12         Physics  \n",
       "7         No           NaN    English 6-12  \n",
       "8         No       PENDING             NaN  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_clean = df.clean_names().remove_empty()\n",
    "df_clean.head(9).tail(4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now this is starting to shape up well!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Renaming Individual Columns\n",
    "\n",
    "Next, let's rename some of the columns. `%_allocated` and `full_time?` contain non-alphanumeric characters, so they make it a bit harder to use. We can rename them using the :py:meth:`rename_column()` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>first_name</th>\n",
       "      <th>last_name</th>\n",
       "      <th>employee_status</th>\n",
       "      <th>subject</th>\n",
       "      <th>hire_date</th>\n",
       "      <th>percent_allocated</th>\n",
       "      <th>full_time</th>\n",
       "      <th>certification</th>\n",
       "      <th>certification_1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Jason</td>\n",
       "      <td>Bourne</td>\n",
       "      <td>Teacher</td>\n",
       "      <td>PE</td>\n",
       "      <td>39690.0</td>\n",
       "      <td>0.75</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Physical ed</td>\n",
       "      <td>Theater</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Jason</td>\n",
       "      <td>Bourne</td>\n",
       "      <td>Teacher</td>\n",
       "      <td>Drafting</td>\n",
       "      <td>39690.0</td>\n",
       "      <td>0.25</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Physical ed</td>\n",
       "      <td>Theater</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Alicia</td>\n",
       "      <td>Keys</td>\n",
       "      <td>Teacher</td>\n",
       "      <td>Music</td>\n",
       "      <td>37118.0</td>\n",
       "      <td>1.00</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Instr. music</td>\n",
       "      <td>Vocal music</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Ada</td>\n",
       "      <td>Lovelace</td>\n",
       "      <td>Teacher</td>\n",
       "      <td>NaN</td>\n",
       "      <td>27515.0</td>\n",
       "      <td>1.00</td>\n",
       "      <td>Yes</td>\n",
       "      <td>PENDING</td>\n",
       "      <td>Computers</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Desus</td>\n",
       "      <td>Nice</td>\n",
       "      <td>Administration</td>\n",
       "      <td>Dean</td>\n",
       "      <td>41431.0</td>\n",
       "      <td>1.00</td>\n",
       "      <td>Yes</td>\n",
       "      <td>PENDING</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  first_name last_name employee_status   subject  hire_date  \\\n",
       "0      Jason    Bourne         Teacher        PE    39690.0   \n",
       "1      Jason    Bourne         Teacher  Drafting    39690.0   \n",
       "2     Alicia      Keys         Teacher     Music    37118.0   \n",
       "3        Ada  Lovelace         Teacher       NaN    27515.0   \n",
       "4      Desus      Nice  Administration      Dean    41431.0   \n",
       "\n",
       "   percent_allocated full_time certification certification_1  \n",
       "0               0.75       Yes   Physical ed         Theater  \n",
       "1               0.25       Yes   Physical ed         Theater  \n",
       "2               1.00       Yes  Instr. music     Vocal music  \n",
       "3               1.00       Yes       PENDING       Computers  \n",
       "4               1.00       Yes       PENDING             NaN  "
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_clean = (\n",
    "    df.clean_names()\n",
    "    .remove_empty()\n",
    "    .rename_column(\"%_allocated\", \"percent_allocated\")\n",
    "    .rename_column(\"full_time_\", \"full_time\")\n",
    ")\n",
    "\n",
    "df_clean.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note how now we have really nice column names! You might be wondering why I'm not modifying the two certification columns -- that is the next thing we'll tackle."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Coalescing Columns\n",
    "\n",
    "If we look more closely at the two `certification` columns, we'll see that they look like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>certification</th>\n",
       "      <th>certification_1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Physical ed</td>\n",
       "      <td>Theater</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Physical ed</td>\n",
       "      <td>Theater</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Instr. music</td>\n",
       "      <td>Vocal music</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>PENDING</td>\n",
       "      <td>Computers</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>PENDING</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Science 6-12</td>\n",
       "      <td>Physics</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Science 6-12</td>\n",
       "      <td>Physics</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>NaN</td>\n",
       "      <td>English 6-12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>PENDING</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Physical ed</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>NaN</td>\n",
       "      <td>Political sci.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>Vocal music</td>\n",
       "      <td>English</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   certification certification_1\n",
       "0    Physical ed         Theater\n",
       "1    Physical ed         Theater\n",
       "2   Instr. music     Vocal music\n",
       "3        PENDING       Computers\n",
       "4        PENDING             NaN\n",
       "5   Science 6-12         Physics\n",
       "6   Science 6-12         Physics\n",
       "7            NaN    English 6-12\n",
       "8        PENDING             NaN\n",
       "9    Physical ed             NaN\n",
       "10           NaN  Political sci.\n",
       "11   Vocal music         English"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_clean[[\"certification\", \"certification_1\"]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Rows 8 and 11 have NaN in the left certification column, but have a value in the right certification column. Let's assume for a moment that the left certification column is intended to record the first certification that a teacher had obtained. In this case, the values in the right certification column on rows 8 and 11 should be moved to the first column. Let's do that with Janitor, using the `coalesce()` method, which does the following:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>first_name</th>\n",
       "      <th>last_name</th>\n",
       "      <th>employee_status</th>\n",
       "      <th>subject</th>\n",
       "      <th>hire_date</th>\n",
       "      <th>percent_allocated</th>\n",
       "      <th>full_time</th>\n",
       "      <th>certification</th>\n",
       "      <th>certification_1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Jason</td>\n",
       "      <td>Bourne</td>\n",
       "      <td>Teacher</td>\n",
       "      <td>PE</td>\n",
       "      <td>39690.0</td>\n",
       "      <td>0.75</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Physical ed</td>\n",
       "      <td>Theater</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Jason</td>\n",
       "      <td>Bourne</td>\n",
       "      <td>Teacher</td>\n",
       "      <td>Drafting</td>\n",
       "      <td>39690.0</td>\n",
       "      <td>0.25</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Physical ed</td>\n",
       "      <td>Theater</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Alicia</td>\n",
       "      <td>Keys</td>\n",
       "      <td>Teacher</td>\n",
       "      <td>Music</td>\n",
       "      <td>37118.0</td>\n",
       "      <td>1.00</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Instr. music</td>\n",
       "      <td>Vocal music</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Ada</td>\n",
       "      <td>Lovelace</td>\n",
       "      <td>Teacher</td>\n",
       "      <td>NaN</td>\n",
       "      <td>27515.0</td>\n",
       "      <td>1.00</td>\n",
       "      <td>Yes</td>\n",
       "      <td>PENDING</td>\n",
       "      <td>Computers</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Desus</td>\n",
       "      <td>Nice</td>\n",
       "      <td>Administration</td>\n",
       "      <td>Dean</td>\n",
       "      <td>41431.0</td>\n",
       "      <td>1.00</td>\n",
       "      <td>Yes</td>\n",
       "      <td>PENDING</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Chien-Shiung</td>\n",
       "      <td>Wu</td>\n",
       "      <td>Teacher</td>\n",
       "      <td>Physics</td>\n",
       "      <td>11037.0</td>\n",
       "      <td>0.50</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Science 6-12</td>\n",
       "      <td>Physics</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Chien-Shiung</td>\n",
       "      <td>Wu</td>\n",
       "      <td>Teacher</td>\n",
       "      <td>Chemistry</td>\n",
       "      <td>11037.0</td>\n",
       "      <td>0.50</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Science 6-12</td>\n",
       "      <td>Physics</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>James</td>\n",
       "      <td>Joyce</td>\n",
       "      <td>Teacher</td>\n",
       "      <td>English</td>\n",
       "      <td>32994.0</td>\n",
       "      <td>0.50</td>\n",
       "      <td>No</td>\n",
       "      <td>English 6-12</td>\n",
       "      <td>English 6-12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Hedy</td>\n",
       "      <td>Lamarr</td>\n",
       "      <td>Teacher</td>\n",
       "      <td>Science</td>\n",
       "      <td>27919.0</td>\n",
       "      <td>0.50</td>\n",
       "      <td>No</td>\n",
       "      <td>PENDING</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Carlos</td>\n",
       "      <td>Boozer</td>\n",
       "      <td>Coach</td>\n",
       "      <td>Basketball</td>\n",
       "      <td>42221.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>No</td>\n",
       "      <td>Physical ed</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>Young</td>\n",
       "      <td>Boozer</td>\n",
       "      <td>Coach</td>\n",
       "      <td>NaN</td>\n",
       "      <td>34700.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>No</td>\n",
       "      <td>Political sci.</td>\n",
       "      <td>Political sci.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>Micheal</td>\n",
       "      <td>Larsen</td>\n",
       "      <td>Teacher</td>\n",
       "      <td>English</td>\n",
       "      <td>40071.0</td>\n",
       "      <td>0.80</td>\n",
       "      <td>No</td>\n",
       "      <td>Vocal music</td>\n",
       "      <td>English</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      first_name last_name employee_status     subject  hire_date  \\\n",
       "0          Jason    Bourne         Teacher          PE    39690.0   \n",
       "1          Jason    Bourne         Teacher    Drafting    39690.0   \n",
       "2         Alicia      Keys         Teacher       Music    37118.0   \n",
       "3            Ada  Lovelace         Teacher         NaN    27515.0   \n",
       "4          Desus      Nice  Administration        Dean    41431.0   \n",
       "5   Chien-Shiung        Wu         Teacher     Physics    11037.0   \n",
       "6   Chien-Shiung        Wu         Teacher   Chemistry    11037.0   \n",
       "7          James     Joyce         Teacher     English    32994.0   \n",
       "8           Hedy    Lamarr         Teacher     Science    27919.0   \n",
       "9         Carlos    Boozer           Coach  Basketball    42221.0   \n",
       "10         Young    Boozer           Coach         NaN    34700.0   \n",
       "11       Micheal    Larsen         Teacher     English    40071.0   \n",
       "\n",
       "    percent_allocated full_time   certification certification_1  \n",
       "0                0.75       Yes     Physical ed         Theater  \n",
       "1                0.25       Yes     Physical ed         Theater  \n",
       "2                1.00       Yes    Instr. music     Vocal music  \n",
       "3                1.00       Yes         PENDING       Computers  \n",
       "4                1.00       Yes         PENDING             NaN  \n",
       "5                0.50       Yes    Science 6-12         Physics  \n",
       "6                0.50       Yes    Science 6-12         Physics  \n",
       "7                0.50        No    English 6-12    English 6-12  \n",
       "8                0.50        No         PENDING             NaN  \n",
       "9                 NaN        No     Physical ed             NaN  \n",
       "10                NaN        No  Political sci.  Political sci.  \n",
       "11               0.80        No     Vocal music         English  "
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_clean = (\n",
    "    df.clean_names()\n",
    "    .remove_empty()\n",
    "    .rename_column(\"%_allocated\", \"percent_allocated\")\n",
    "    .rename_column(\"full_time_\", \"full_time\")\n",
    "    .coalesce(\"certification\", \"certification_1\", new_column_name=\"certification\")\n",
    ")\n",
    "\n",
    "df_clean"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Awesome stuff! Now we don't have two columns of scattered data, we have one column of densely populated data.`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Dealing with Excel Dates\n",
    "\n",
    "Finally, notice how the `hire_date` column isn't date formatted. It's got this weird Excel serialization.\n",
    "To clean up this data, we can use the :py:meth:`convert_excel_date` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>first_name</th>\n",
       "      <th>last_name</th>\n",
       "      <th>employee_status</th>\n",
       "      <th>subject</th>\n",
       "      <th>hire_date</th>\n",
       "      <th>percent_allocated</th>\n",
       "      <th>full_time</th>\n",
       "      <th>certification</th>\n",
       "      <th>certification_1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Jason</td>\n",
       "      <td>Bourne</td>\n",
       "      <td>Teacher</td>\n",
       "      <td>PE</td>\n",
       "      <td>2008-08-30</td>\n",
       "      <td>0.75</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Physical ed</td>\n",
       "      <td>Theater</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Jason</td>\n",
       "      <td>Bourne</td>\n",
       "      <td>Teacher</td>\n",
       "      <td>Drafting</td>\n",
       "      <td>2008-08-30</td>\n",
       "      <td>0.25</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Physical ed</td>\n",
       "      <td>Theater</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Alicia</td>\n",
       "      <td>Keys</td>\n",
       "      <td>Teacher</td>\n",
       "      <td>Music</td>\n",
       "      <td>2001-08-15</td>\n",
       "      <td>1.00</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Instr. music</td>\n",
       "      <td>Vocal music</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Ada</td>\n",
       "      <td>Lovelace</td>\n",
       "      <td>Teacher</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1975-05-01</td>\n",
       "      <td>1.00</td>\n",
       "      <td>Yes</td>\n",
       "      <td>PENDING</td>\n",
       "      <td>Computers</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Desus</td>\n",
       "      <td>Nice</td>\n",
       "      <td>Administration</td>\n",
       "      <td>Dean</td>\n",
       "      <td>2013-06-06</td>\n",
       "      <td>1.00</td>\n",
       "      <td>Yes</td>\n",
       "      <td>PENDING</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Chien-Shiung</td>\n",
       "      <td>Wu</td>\n",
       "      <td>Teacher</td>\n",
       "      <td>Physics</td>\n",
       "      <td>1930-03-20</td>\n",
       "      <td>0.50</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Science 6-12</td>\n",
       "      <td>Physics</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Chien-Shiung</td>\n",
       "      <td>Wu</td>\n",
       "      <td>Teacher</td>\n",
       "      <td>Chemistry</td>\n",
       "      <td>1930-03-20</td>\n",
       "      <td>0.50</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Science 6-12</td>\n",
       "      <td>Physics</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>James</td>\n",
       "      <td>Joyce</td>\n",
       "      <td>Teacher</td>\n",
       "      <td>English</td>\n",
       "      <td>1990-05-01</td>\n",
       "      <td>0.50</td>\n",
       "      <td>No</td>\n",
       "      <td>English 6-12</td>\n",
       "      <td>English 6-12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Hedy</td>\n",
       "      <td>Lamarr</td>\n",
       "      <td>Teacher</td>\n",
       "      <td>Science</td>\n",
       "      <td>1976-06-08</td>\n",
       "      <td>0.50</td>\n",
       "      <td>No</td>\n",
       "      <td>PENDING</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Carlos</td>\n",
       "      <td>Boozer</td>\n",
       "      <td>Coach</td>\n",
       "      <td>Basketball</td>\n",
       "      <td>2015-08-05</td>\n",
       "      <td>NaN</td>\n",
       "      <td>No</td>\n",
       "      <td>Physical ed</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>Young</td>\n",
       "      <td>Boozer</td>\n",
       "      <td>Coach</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1995-01-01</td>\n",
       "      <td>NaN</td>\n",
       "      <td>No</td>\n",
       "      <td>Political sci.</td>\n",
       "      <td>Political sci.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>Micheal</td>\n",
       "      <td>Larsen</td>\n",
       "      <td>Teacher</td>\n",
       "      <td>English</td>\n",
       "      <td>2009-09-15</td>\n",
       "      <td>0.80</td>\n",
       "      <td>No</td>\n",
       "      <td>Vocal music</td>\n",
       "      <td>English</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      first_name last_name employee_status     subject  hire_date  \\\n",
       "0          Jason    Bourne         Teacher          PE 2008-08-30   \n",
       "1          Jason    Bourne         Teacher    Drafting 2008-08-30   \n",
       "2         Alicia      Keys         Teacher       Music 2001-08-15   \n",
       "3            Ada  Lovelace         Teacher         NaN 1975-05-01   \n",
       "4          Desus      Nice  Administration        Dean 2013-06-06   \n",
       "5   Chien-Shiung        Wu         Teacher     Physics 1930-03-20   \n",
       "6   Chien-Shiung        Wu         Teacher   Chemistry 1930-03-20   \n",
       "7          James     Joyce         Teacher     English 1990-05-01   \n",
       "8           Hedy    Lamarr         Teacher     Science 1976-06-08   \n",
       "9         Carlos    Boozer           Coach  Basketball 2015-08-05   \n",
       "10         Young    Boozer           Coach         NaN 1995-01-01   \n",
       "11       Micheal    Larsen         Teacher     English 2009-09-15   \n",
       "\n",
       "    percent_allocated full_time   certification certification_1  \n",
       "0                0.75       Yes     Physical ed         Theater  \n",
       "1                0.25       Yes     Physical ed         Theater  \n",
       "2                1.00       Yes    Instr. music     Vocal music  \n",
       "3                1.00       Yes         PENDING       Computers  \n",
       "4                1.00       Yes         PENDING             NaN  \n",
       "5                0.50       Yes    Science 6-12         Physics  \n",
       "6                0.50       Yes    Science 6-12         Physics  \n",
       "7                0.50        No    English 6-12    English 6-12  \n",
       "8                0.50        No         PENDING             NaN  \n",
       "9                 NaN        No     Physical ed             NaN  \n",
       "10                NaN        No  Political sci.  Political sci.  \n",
       "11               0.80        No     Vocal music         English  "
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_clean = (\n",
    "    df.clean_names()\n",
    "    .remove_empty()\n",
    "    .rename_column(\"%_allocated\", \"percent_allocated\")\n",
    "    .rename_column(\"full_time_\", \"full_time\")\n",
    "    .coalesce(\"certification\", \"certification_1\", target_column_name=\"certification\")\n",
    "    .convert_excel_date(\"hire_date\")\n",
    ")\n",
    "df_clean"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have a cleaned dataframe!"
   ]
  }
 ],
 "metadata": {
  "interpreter": {
   "hash": "98b0a9b7b4eaaa670588a142fd0a9b87eaafe866f1db4228be72b4211d12040f"
  },
  "kernelspec": {
   "display_name": "Python 3.8.10 64-bit ('base': conda)",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
